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Abstract 

Complex networks describe a wide range of systems in nature and society. To understand complex networks, it is crucial to 
Investigate their community structure. In this paper, we develop an online community detection algorithm with linear time 
complexity for large complex networks. Our algorithm processes a network edge by edge in the order that the network is 
fed to the algorithm. If a new edge is added, it just updates the existing community structure in constant time, and does not 
need to re-compute the whole network. Therefore, it can efficiently process large networks in real time. Our algorithm 
optimizes expected modularity instead of modularity at each step to avoid poor performance. The experiments are carried 
out using 1 1 public data sets, and are measured by two criteria, modularity and NMI (Normalized IVlutual Information). The 
results show that our algorithm's running time is less than the commonly used Louvain algorithm while it gives competitive 
performance. 
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Introduction 

Complex networks describe a wide range of systems in nature 
and society [1-3]. Frequently cited examples include the Internet 
in which routers and computers are connected by physical links, 
and collaboration networks in which researchers are linked by 
coauthoring. To understand the formation, evolution, and 
function of complex networks, it is crucial to investigate their 
community structure, not only for uncovering the relations 
between internal structure and functions, but also for practical 
applications in many disciplines such as biology and sociology [4- 
6]. 

Intuitively, a community of a complex network consists of a 
cohesive group of nodes that are relatively densely connected to 
each other but sparsely connected to other dense groups in the 
network [7]. Community detection aims to identify the commu- 
nities by only using the information encoded in the network 
topology [8] . It is one of the critical issues in the study of complex 
networks. A wide variety of community detection methods have 
been developed to serve different scientific needs [8,9] . 

Modularity is a commonly used criterion for community 
detection. It was first proposed in Newman et al. [10]. Good el 
al. [1 1] describe the performance of modularity maximization in 
practical contexts and present a broad characterization of its 
performance in such situations. A wide variety of algorithms for 
solving the modularity optimization problem have been developed 
[12]. For example, Clauset et al. [13] present a hierarchical 
agglomeration algorithm for detecting communities. Newman et 
al. [14] show that the modularity can be expressed in terms of the 
eigenvectors of a characteristic matrix for the network. This 
expression leads to a spectral algorithm for community detection. 



Modularity can be generalized in a principled fashion to 
incorporate the edge information such as direction and weight. 
Leicht et al. [15] consider the problem of finding communities in 
directed networks. Newman et al. [16] point out that weighted 
networks can, in many cases, be analyzed using a simple mapping 
from a weighted network to an unweighted multigraph. Lancichi- 
netti et al. [9] generate directed and weighted networks with built- 
in community structure and show how modularity optimization 
performs on their benchmark. However, Fortunato et al. [17] find 
that modularity optimization may fail to identify communities 
smaller than a scale which depends on the total size of the network 
and on the degree of interconnectedness of the communities, 
which is called a resolution problem. To mitigate the resolution 
issue, Reichardt el al. [18] show how community detection can be 
interpreted as finding the ground state of an infinite range spin 
glass. Ruan et al. [19] propose a recursive algorithm HQCUT to 
solve the resolution limit problem. Arenas et al. [20] propose a 
method that allows for multiple resolution screening of modular 
structures. Aldecoa et al. [21] introduce a criteria called 
"Surprise" to resolve the resolution problem. 

In some kinds of complex networks, new edges continually 
appear while old edges do not disappear, resulting in a large 
network. For example, citation networks are growing as new 
papers cite existing papers. To efficientiy process these kinds of 
networks, we desire a community detection algorithm that will be 
able to process a network (1) without recomputing whole network 
after every new edge/node and (2) without the need of whole 
network structure available at each update. Although many 
community detection algorithms have been proposed, to our best 
knowledge, there is no algorithm that can meet these two 
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requirements. Many existing algorithms need to start from the 
beginning when the network is expanded, even when only one 
node or edge is added. 

Many efTorts have been made to meet the two requirements. 
Leung el al. [22] identified novel characteristics and drawbacks 
of label propagation algorithm, and extended it by incorpo- 
rating different heuristics to facilitate reliable and multi- 
functional real time community detection. Huang et al. [23] 
introduced a new quality function of local community, and 
presented a fast local expansion algorithm for uncovering 
communities in large-scale networks. Kawadia eA al. [24] 
presented a new measure of partition distance called estrange- 
ment, and showed that constraining estrangement enables it to 
find meaningful temporal communities in diverse real-world 
data sets. However, both Leung's algorithm and Huang's 
algorithm cannot handle growing networks, since they must 
recompute the whole network after every new edge/node. 
Kawadia' s algorithm requires the whole network structure to 
be available at each update. 

In this paper, we develop a community detection algorithm 
to meet the two requirements. Our algorithm is an online 
algorithm, i.e. it can process a network edge by edge in the 
order that the network is fed to the algorithm, without having 
the whole network available from the start. Our algorithm 
updates existing community structure in constant time once a 
new edge is added. The update avoids re-processing the whole 
network structure, since it only needs knowledge about a 
network's local structure related to the new edge, thus our 
algorithm can efficiently process large networks in real time. 
Our algorithm has 0(M) time complexity and 0(NK) space- 
complexity, where M is number of edges, N is number of 
nodes, and K is number of communities. 

This paper is an extension of our previous work [25] published 
in IJCAFIS (downloaded for free in http://ijcai.org/papersl3/ 
Papers/IJGAIl 3-281.pdf). The main differences are three-fold: (1) 
This paper proposes a generative model for complex network 
based on preferential attachment mechanism, which helps us to 
infer network's future structure by its current structure and gives a 
solid theoretical motivation to the algorithm; (2) This paper 
develops a deterministic online community detection algorithm, 
which uses expected modularity to make an informed choice. The 
conference paper's non-deterministic algorithm may need many 
runs; (3) This paper uses additional datasets and extensive 
experiments for more convincing results. 

Method 

To achieve the onUne community detection, we first propose a 
generative model for complex networks based on the preferential 
attachment mechanism [26,27], which helps us to predict a 
network's future structure based on its current structure. We then 
develop an online community detection algorithm, which 
processes a network edge by edge. It optimizes expected 
modularity instead of modularity to avoid poor performance in 
some specific cases. Expected modularity can be calculated based 
on our generative model. 

Preliminaries 

A network G = {V,E} is a set of A'^ nodes F= {vi, . . . ,vjv} 
connected by a set of Af edges E = {ey = {vi,Vj}} . The network 
considered here is undirected, unweighted, and without self-loops 
or isolated node. Let P = {Ci, . . . ,Ck} denote a partition of V. It 
is a division of V into K non-overlapping and non-empty 
communities Ck that cover all of F. As a performance measure 



for the partition quality, modularity was first proposed by 
Newman et al. [28] . It can be expressed as 



edg(Ck) _ ^egiCk) 



\E\ 



2\E\ 



(1) 



where edg{Cic) = \{eij\VieCic and VjSCk}\ is the number of intra- 
community edges within community Cj^, \E\ is the number of 
edges within network G, and deg{Ck) is the degree of community 
Ck, defined as deg{Ck)= / deg(v,), where degiyi) is the 

degree of node V', . Hence community detection can be formulated 
as a modularity optimization problem 

max q(P) 
p 

and Brandes et al. [29] prove the conjectured hardness of this 
problem both in the general case and in the case with restriction to 
number of partitions K. This result makes heuristic techniques the 
only viable option for modularity optimization problem. However, 
heuristic techniques cannot guarantee that the partition is good 
enough. It may result in a poor partition for some networks. In 
other words, the algorithms fail to achieve an acceptable 
modularity. We say an algorithm encounters failure if all nodes 
are assigned to the same community. 

Generative Model for Complex Network 

Complex networks have non-trivial topological features that do 
not occur in some simple networks but often occur in real 
networks. An important feature of many complex networks is that 
their degree distributions follow a particular mathematical 
function called the power law [27,30,31], although it does not 
always hold [32]. The power law implies that the degree 
distribution of the network has no characteristic scale. 

It is widely recognized as a seminal work presenting a model for 
the observed stationary scale-free distributions of complex 
networks by Price et al. [26]. Barabasi et al. [27] conclude that 
this feature is a consequence of two generic mechanisms: (1) 
networks expand continuously by the addition of new nodes; (2) 
new nodes attach preferentially to communities that are already 
well connected. Barabasi's model is recognized by academia 
[33,34]. Specifically, a new node V/ will attach to an existing node 
V, with probability piyi) in proportion to the degree of node v,- 



p{vi)aideg{\i). 



(2) 



The above model only considers the case that a new edge links a 
new node to an existing node. However, a new edge may link two 
existing nodes or two new nodes. In fact, estimating the likelihood 
of the appearance of a new edge between two existing nodes, 
called link prediction, is one of the fundamental problems in 
network analysis. A variant of preferential attachment mechanism 
can be used to do link prediction [35]. Specifically, a new edge wiU 
link two existing nodes v,- and vj with probability p(v,-,v,) in 
proportion to the product of the degree of node v,- and the degree 
of node v, 



p{vuVj) oc degivi)deg(vj). 



(3) 
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For a complete review of the statistical mechanics of network 
topology and dynamics of complex networks, one can refer to 
Boccaletti a/. [34] or Albert «i fl/. [36] . Mitzenmacher a/. [37] 
briefly surveyed some other generative models that lead to scale- 
free distributions. For a summary of recent progress about link 
prediction algorithms, one can refer to Lu et al. [38] . 

To facilitate subsequent work, we generalize a preferential 
attachment mechanism from node to community. A new node wUl 
attach to an existing community Ck with probability piCi^) in 
proportion to the degree of community Ck 

p{Ck)ozdeg(Ck) 

and a new edge will link two existing communities C/tj and Ckj 
with probability p(Ck^,Ck2) in proportion to the product of the 
degree of community Ci^ and the degree of community C^^ 

Here we propose a generative model for complex networks. Our 
model generates a network G with M edges by addition of new 
edges. It is starting from an empty network Go = 0. For 
»! = 0, . . . ,M— 1, there are three cases for a new edge 
^m + l ={V;ii7} to be added in network G„, = {Vm,E,„}, namely. 

Case (a): link a new node to an existing node, 
{vi,Vj}r\Vm = {vi} or {vi,Vj}r\V„, = {Vj}, with probability Pa, 

Case (b): link two existing nodes, {v,,v,}s ¥„„ with probabil- 
ity i'ft; 

Case (c): link two new nodes, {v,,v,}n Vm=0, with probabil- 
ity i'c- 

For case (a) and (b), the addition of the new edge follows 
preferential attachment mechanism mentioned above (See Fig. 1). 




(c) 

Figure 1. Three cases for a new edge to be added in an existing 

nodes; (c) linking two new nodes. 
doi:1 0.1 371/joumal.pone.01 02799.g001 



Notice that Pa + pi, +pc = l- When /)« = !, our model is the same 
as Barabasi's model for growing networks. 

Online Community Detection Algorithm 

A straightforward way to do online community detection is to 
take a sequence of edges as input, and optimize modularity 
qiPm+l) at each step for current network G„,+ i based on previous 
partition P,,,. However, this greedy algorithm may have poor 
performance. Considering Barabasi's model that every new edge 
links a new node to an existing node, Brandes et al. [29] prove that 
a partition with maximum modularity has no community that 
consists of a single node with degree one, and a new node should 
be assigned to an existing community, however this operation 
makes all nodes in a same community and results in zero 
modularity. 

To avoid poor performance, our algorithm optimizes expected 
modularity E[q{PM)] for final network Gm, instead of modularity 
q{Pm+i) for current network Gm + i at each step. We calculate 
E[q(PM)] based on our generative model and the partition as 
follows: for existing nodes, we keep them in their current 
communities; for new nodes, we assign them to the corresponding 
existing communities to keep the degree of every existing 
community (defined as sum of degree of nodes which belong to 
that community) increasing and the expected increment of the 
degree of community is proportional to the degree of community. 
Such partition can make subsequent deriving of expected 
modularity simple. 

First we calculate q(Pm+\)- Notice that X]q„eP„ sdg(Ck^m) can 
be expressed as 




(b) 

New edge 
Existing edge 




New node 



Existing node 

networl(. (a) linking a new node to an existing node; (b) linking two existing 
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Figure 2. Two operations to process a new edge iinlting a new node to an existing node, (a) A new node attaches to an existing node with 
degree two, it joins the same community as the existing node; (b) Another new node attaches to the previous new node with degree one, it splits as 
a new community. 

doi:1 0.1 371 /journal.pone.01 02799.g002 



where Ckj,, is community Ck at step m and \E„,\ is the number of 
edges within network G,„, \Em\ is always equal to m as our 
algorithm processes one edge at one step. Hence q{Pm+\) can be 
expressed as 



m 



ef/g(Q.,,„+i) / degjCk^m+i) 
\Em+t\ V 2|£'„, + i| 



m+l 
I 

m+ 1 
1 



1 



^ deg(Ckj„f 



(4) 



4(m+ir 



deg(Ck.m+\f- ^ deg(Ck.mf 



~k,m+\^' m+l 



Then we calculate E{q(P,„j^\)) under three cases separately as 
follows: 

Case (a): link a new node to an existing node. Without loss of 
generality, we assume v, is the existing node and Vj is the new 
node. We assign the new node to the same community as the 
existing node and have 



E[q{Pm+\)\= q(P„+\)p(Ck{i),m) 
'^jt(/),m^^'" 



deg{Ck(i),m) 

2m 



-^q(P,„)- ^ V deg(Ck,J^ 



where Q-(,) is the community which node v, belongs to. 

Case (b): link two existing nodes. We do not change the 
partition and have 



£[g(P„,+i)l = — 

W1+ 1 

1 ^ . ,„ o 1 



^ deg(Ck,,„f 



2{m+lY 



Case (c): link two new nodes. We assign two new nodes to an 
existing community with probability in proportion to the degree of 
the existing community. Case (c)'s q(P,„+\) and E[q{Pm+\)\ are 
the same as case (a)'s. 
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(a) (b) 
Node of community A 

Node of community B 

Figure 3. Two situations of a new edge linicing two existing nodes, (a) Nodes belong to a same community; (b) Nodes belong to different 
communities. 

doi:1 0.1 371 /journal.pone.01 02799.g003 




Finally we calculate E[q(PM)\ by combining ii[g(P„,+i)] under 
three cases together and applying it iteratively 



E[q(P^ 



m+l 



1 



AM ^ m'(m'+\)' 

ni < hi' <M 



+ 



Ph _ 
nKiii' <M 



E 



4M 



y 



flKltl' <M 

+f(M,in) 



m'{m' +1) 



K^nr in' 

C, ,eP I 
k,m m' 

E deg(Ci,„,if 



(5) 



So we have 

y E[deg{Ck,n,'f<E[ J2 deg(CkM? 
^ E[deg{Ck^,n')f 



< 



+ K„,[{m'-(m+l)f] 
where K.,, is the number of communities within network G,,, and 



, (2w-2)(M-m-l) + (2-w)(lnAf-ln(»i + l)) ^ , 

Cfcm+i «'■,„+ 1 



, M — in J 
+f{M,m) + 0[K„,(^^f 



where f(M,m) only depends on M and m. 

As our partition keeps the degree of every existing community 
increasing, we have 

deg{Ck.m+ 1 ) < deg(Ck,n,') < deg{Ck,,„+i) + 2(m' - (m + 1)) 

and the expected increment of the degree of community is 
proportional to the degree of community, thus the expected degree 
of community Ck at step m' can be expressed as 

E[deg(Ck,m')] = — —rdeg{Ck,m+i)- 
m+ 1 

According to the Popoviciu inequality on variance, the variance 
oi deg{Ck,m') has a loose upper bound 

Var[deg{Ck,,„')] < [m'-(m+ \f. 



Now we describe the online community detection algorithm. 
For initial network Go = 0, it is clear that the best partition Pq is an 
empty set too. For subsequent networks Gm+i, 
W7 = 0,1,2, ... ,Af— 1, we consider some candidate operations 
which update the partition. Each operation has its corresponding 
E[q(PM)\- We take the operation which has the largest E[q{PM)\- 
In fact, we only need to know expected modularity gain 
l!i.E[q(PM)\, which is defined as E[q{PM)\ of one operation minus 
E[q(PM)\ of another 

■^I<;<^m)1 = '^a?(p„+,) 

We describe our operations under three cases separately as 
follows: 

Case (a): link a new node to an existing node. We consider two 
operations: the Split operation where the new node splits as a new 
community, and the Join operation where the new node joins the 
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Table 1. Summary of network data sets. 





Data set 


Number of nodes 


Number of edges 


ca-CondMat 


23,133 


93,439 


ca-HepPh 


1 2,008 


1 1 8,489 


email-Enron 


36,692 


183,831 


ca-AstroPh 


18,772 


198,050 


cit-HepTh 


27,770 


352,285 


cit-HepPh 


34,546 


420,877 


corn-Amazon 


334,863 


925,872 


com-DBLP 


317,080 


1,049,866 


web-Stanford 


281,903 


1,992,636 


AmazonOeOI 


403,394 


2,443,408 


WikiTalk 


2,394,385 


4,659,565 



doi:1 0.1 371 /journal.pone.Ol 02799.t001 




Figure 4. The evolution of temporal modularity over time by OLEM and OLTM. 

doi:1 0.1 371/journal.pone.01 02799.g004 
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Table 2. Modularity by three community detection algorithms. 





Data set 


OLEM 


OLTM 


Louvain 


ca-CondMat 


0.6446 


0.6585 


0.7288 


ca-HepPh 


0.5734 


0.6052 


0.6549 


email-Enron 


0.5447 


0.0464 


0.5876 


ca-AstroPh 


0.5418 


0.5523 


0.6149 


cit-HepTh 


0.5885 


0.6146 


0.6571 


cit-HepPh 


0.6278 


0.6771 


0.7228 


corn-Amazon 


0.7050 


0.7057 


0.9256 


com-DBLP 


0.7252 


0.7335 


0.8091 


web-Stanford 


0.8377 


0.8702 


0.9256 


AmazonOeOI 


0.7785 


0.4533 


0.8670 


WikiTalk 


0.5344 


0.0897 


0.5831 



doi:l 0.1 371 /journal.pone.01 02799.t002 



same community as the existing node (See Fig. 2). Witliout loss of 
generality, we assume v, is the existing node and Vj is the new 
node. 

For the Split operation, we have 



<lSplui.Pm+l) 

m 



q{Pn,)- 



Am(m+\f c 

k,m 



2 deg{Ck(i),in) + 1 



The existing community Q;(i) has degree 

deg(Ck(i)^m + i) = deg(Ck(i),,„) + l and the new community Ck+i 
has degree deg{CK+i,m + i) = ^ at step ni+l. 

For the Join operation, we have 



m 



m+ 1 



q(Pn,)- 



J in \2 deg(Ck(i),„,)-m 



The existing community Ca:(,) has degree deg{Ck{i)^„+i)-- 
deg(Ck(i)^„,) + 2 at step ot+ 1. 
Then we have 



x IV \ IV \ IV \ deg{Ck(i),m)-2m-l 

M\"m+\ ) = qSplil(Pm+l ) — qjoin{Pm+l ) = — 

2(m + 1) 

and 



A ^ deg(Ck,„+if= -2deg(Ck(i)^„)-2. 



We estimate pi, by observed frequency of case (b). Taking 
together and omitting the error term, we can obtain ^E(q(Pi^)), 
and take the Split operation if it is positive or the Join operation 
otherwise. 

Case (b): link two existing nodes, two existing nodes may or 
may not belong to the same community (See Fig. 3). If both nodes 
belong to the same community, it is hard to propose a suitable 
candidate operation. So, we take the Dense operation where we 
keep current partition unchanged. Otherwise we consider two 



Table 3. Average running time (in seconds) over 10 runs by three community detection algorithms. 





Data set 


OLEM 


OLTM 


Louvain 


ca-CondMat 


0.3120 


0.2683 


0.5029 


ca-HepPh 


0.3344 


0.2387 


0.3872 


email-Enron 


0.6396 


0.4040 


0.7004 


ca-AstroPh 


0.6334 


0.5257 


0.5806 


cit-HepTh 


1 .0972 


0.8502 


1.0353 


cit-HepPh 


1.3736 


1.1576 


1.1707 


corn-Amazon 


5.0592 


4.7429 


6.6453 


com-DBLP 


4.6914 


4.3096 


6.9377 


web-Stanford 


5.8879 


5.1023 


32.0137 


Amazon0601 


9.8601 


8.1857 


12.7132 


WikiTalk 


25.0043 


17.6910 


27.0956 



doi:l 0.1 371 /journal.pone.01 02799.t003 
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Table 4. Number of communities by three community detection algorithms and Yang's labeled community structure. 



Data set OLEM OLTM Louvain Labeled 

Amazon 1,988 1,979 217 5,000 

DBLP 4,122 3,854 301 5,000 



doi:l 0.1 371 /journal.pone.01 02799.t004 

operations: (1) the Move operation where we move one node from 
its community to another node's community; (2) the Keep 
operation where we keep the current partition unchanged. 
Without loss of generality, we assume v, is the moving node and 
have 

^qiPm +l) =qMow{Pm+l) — ^Keep (P/n + 1 ) 

^ deg(vi,Ckijim)-deg{Vi,Ck(i)^) + 1 
m+l 

_ (deg(Vi) + l)(deg(Ck(j),„,) - deg(Ck(i),„,) + deg(Vi) + 1 ) 

where deg(vj,Cic^ni) = \{sij\Vj£Cii,„}\ is number of edges from the 
node V, to the community Ct at step in and 



principle. If only one node of the current edge belongs to the 
existing network, we split another node to a new community if this 
operation can maximize expected modularity gain, otherwise we 
let it join the same community as the existing node; if both nodes 
of current edge belong to the existing network but they belong to 
difiFerent communities, we move one node according to the same 
principle; if neither node of current edge belongs to the existing 
network, we just assign them to a new community. Obviously, our 
algorithm has 0(M) time complexity. The space complexity is 
0(NK) because we need to store deg{vi,Ck) for calculating 
expected modularity gain in constant time. Our algorithm has two 
major advantages: (1) the update only uses knowledge about 
network's local structure related to the new edge; (2) the update 
can be done in constant time. Thus it can efficiently process large 
networks in real time. 



A ^ deg(Ck,m+if='^deg{Vi){deg{Ck(i),m) 

'^k,m + 1 + 1 

- deg(Ck(ii„,) + deg(vi)). 

Therefore, we obtain AE{q{PM)) and determine the operation 
in the same way as we do in case (a). 

Case (c): link two new nodes, we consider two operations: the 
New operation where we assign two new nodes to a new 
community and the Merge operation where we assign them to 
an existing community. We have 

^<l{Pm + I ) = <iNew(Pm+l ) — <lMerge(Pm+ 1 ) = k(i),m) 

where Ckfiy^m is the existing community and 

A ^ deg(Cicj„+if = -4deg(C^j),„,) 

Notice that AE[q{PM)] is almost always positive for large 
complex networks. So we take the New operation for case (c) to 
reduce complexity. 

In summary, our algorithm takes a sequence of edges as input 
and optimizes expected modularity at each step. We assign node to 
community according to the maximum expected modularity gain 



Results 

In this section, we present the experimental results of our online 
community detection algorithm and compare it with a state-of-the- 
art algorithm, Louvain algorithm, proposed by Blondel et al. [39]. 
For simplicity, we use OLEM to refer to our algorithm, OLTM to 
refer to a simplified version of our algorithm which greedily 
optimizes temporal modularity q{Pm+l) (See Eq.(4)) instead of 
expected modularity E[q(PM)\ (See Eq.(6)), and Louvain to refer 
to the Louvain algorithm. 

The experiments use 1 1 public real-world large network data 
sets from Stanford Large Network Dataset Collection (http:// 
snap.stanford.edu/data/), which are commonly used by research- 
ers. Their number of nodes varies from 12,008 to 2,394,385 and 
their number of edges varies from 93,439 to 4,659,565 (See 
Table 1). These data sets are 

• ca-GondMat: Collaboration network of Arxiv Condensed 
Matter [40]; 

• ca-HepPh: Collaboration network of Arxiv High Energy 
Physics [40]; 

• email-Enron: Email communication network from Enron 

[41]; 

• ca-AstroPh: Collaboration network of Arxiv Astro Physics 
[40]; _ 

• cit-HepTh: Arxiv High Energy Physics paper citation 
network [42]; 

• cit-HepPh: Arxiv High Energy Physics paper citation 
network [40]; 



Table 5. NIVll Benchmark by three community detection algorithms comparing with Yang's labeled community structure. 

Data set OLEM OLTM Louvain 

Amazon 0.7261 0.7273 0.3118 

DBLP 0.2355 0.2376 0.1958 

doi:l 0.1 371 /journal.pone.01 02799.t005 
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Figure 5. The percentage of different operations of OLTM over time. The height of each color segment represents the percentage of an 
operation at a certain progress. "Op" is the abbreviation for "Operation" of OLTM at each step. 
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• corn-Amazon: Amazon product network with labeled 
community structure [43]; 

• com-DBLP: DBLP collaboration network with labeled 
community structure [43]; 

• web-Stanford: Web graph ofStanford.edu [41]; 

• Amazon0601: Amazon product co-purchasing network from 
June 1 2003 [44]; 

• WikiTalk: Wikipedia talk (communication) network [4,5] . 
The edges should be processed in the same order as expanding 

procedure of the networks. However, those data sets do not have 
timestamps on the edges. In the experiments, we process the edges 
in order of their appearance in the raw fdes. 

We use C# to implement our algorithms (Our C# implemen- 
tation can be downloaded from http://www.cs.zju.edu.cn/ 
~gpan/code/pone20 13.zip). For comparison, we employ the C 
implementation of the Louvain algorithm provided by the authors 
(https://sites.google.com/site/findcommunities/). We carry out 
experiments on a Windows based Genuine Intel (R) CPU 17 @ 
2.70 GHz machine with 4.00 GB memory. 

Modularity and average running time (in seconds) over 1 0 runs 
by OLEM, OLTM, and Louvain are reported in Table 2 and 
Table 3. The evolution of temporal modularity over time by 
OLEM and OLTM is shown in Fig. 4. 

We can see that OLTM is faster than Louvain in all data sets 
and OLEM is faster than Louvain in many data sets except ca- 
AstroPh, cit-HepTh and cit-HepPh. With the modularity measure, 
OLEM and OLTM cannot achieve similar performance to 



Louvain. This is due to our algorithms being online one-pass 
algorithms while Louvain is an offline multi-pass algorithm. Our 
algorithms' running times are linear in number of edges as we 
expected while Louvain is not. This is due to the number of passes 
of Louvain is not fixed. Most of all, Louvain needs to start from 
the beginning when a new edge is added while our algorithms do 
not. 

OLTM is faster than OLEM because Acj(P,„ + 1 ) is simpler than 
AE[q{PM)]- In fact, we calculate 2(m+ l)^ Ag{Pi„^i) instead of 
A(j'(P,„+i) in our implementation as the former only involves 
integer arithmetic which is faster than float-point arithmetic. 
OLEM keeps relatively stable performance in all data sets while 
OLTM has exceptionally poor performance in the emaU-Enron 
and WikiTalk data sets. We will further investigate the underlying 
cause for OLTM later. OLTM often performs slightly better than 
OLEM in the other data sets. It may be due to our approximation 
of expected modularity by a lower bound in OLEM. 

As we mentioned in the Introduction Section, the modularity 
optimization based approach may fail to identify communities 
smaller than a scale, which is called a resolution limit problem 
[17]. To investigate this problem, we compare results of OLEM, 
OLTM and Louvain in the com-Amazon and com-DBLP data 
sets. We choose the two data sets because Yang el al. [43] released 
a labeled community structure for either of the data sets (http:// 
snap.stanford.edu/ data/ com-Amazon.html, http:/ / snap. Stanford. 
edu/data/com-DBLP.html). For com-Amazon data set, Yang et 
al. labeled products from the same category as a community and 
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Figure 6. The percentage of different operations of OLEM over time. The height of each color segment represents the percentage of an 
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nodes (products) that belong to a common community share a 
common function or purpose. For com-DBLP data set, they 
labeled authors who published to a certain journal or conference 
as a community and nodes (authors) that belong to a common 
community share a comon research interest. For each data set, we 
use the top 5,000 subset, same as [43], for comparison. 



We find that, although both our method and the Louvain 
method optimize the modularity function, the number of 
communities in Louvain's result is less than that in our results 
(See Table 4). It is due to our method and the Louvain method 
achieving optimization in different ways. The Louvain method 
optimizes the modularity function by merging pair of communities 
in each pass, while our method optimizes the modularity function 



Table 6. The statistics of modularity on 10 reordered data sets as well as modularity on original data set by our algorithm. 





Data set 


q(Original) 


AVG(q) 


MAX(q) 


MIN(q) 


ca-CondMat 


0.6446 


0.5344 


0.5375 


0.5298 


ca-HepPh 


0.5734 


0.5844 


0.5872 


0.5823 


email-Enron 


0.5447 


0.4730 


0.4872 


0.4541 


ca-AstroPh 


0.5418 


0.5468 


0.5536 


0.5427 


cit-HepTh 


0.5885 


0.5777 


0.5873 


0.5588 


cit-HepPh 


0.6278 


0.6376 


0.6524 


0.6288 


corn-Amazon 


0.7050 


0.5903 


0.5916 


0.5895 


com-DBLP 


0.7252 


0.5706 


0.5715 


0.5694 


web-Stanford 


0.8377 


0.7431 


0.7501 


0.7385 


AmazonOeOI 


0.7785 


0.5682 


0.5708 


0.5647 


Wikilalk 


0.5344 


0.5102 


0.5104 


0.5101 



doi:1 0.1 371 /journal.pone.01 02799.t006 
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by moving nodes of the new edge at each step in order to satisfy 
the real-time processing. Generally speaking, merging communi- 
ties may obtain higher modularity gain than moving nodes, so the 
Loimain method is better than our method to optimize the 
modularity. However, merging communities in each pass will 
reduce the number of communities in final result as each merging 
operation wiU eliminate one community. It causes that the 
Louvain method wiU miss small communities. 

Further, the similarity between the results and labeled 
community structures can be measured by NMI (Normalized 
Mutual Information) criterion [46]. We find that, measured in 
NMI, our results are more similar to labeled community structure 
than Louvain's result (See Table 5). The main reason may be that 
our methods can find more communities of small scale, which the 
Louvain method may be hard to identify. 

The reason for OLTM's poor performance in the email-Enron 
and WikiTalk data sets is that OLTM has no Split operation for 
case (b) edge. As OLTM is a greedy approach, it only takes the 
Join operation for case (b) edge to maximize temporal modularity. 
Hence the only way for OLTM to create new community is its 
New operation for case (c) edge. If a data set has few case (c) edges 
at its beginning, OLTM cannot create enough communities in the 
early stage and obtains a poor final partition. In the worst 
situation, the data set has no case (c) edge and OLTM fails. In fact, 
emaU-Enron and WikiTalk data sets have very few case (c) edges at 
their beginning, comparing with the other data sets. 

In contrast, with the help of expected modularity, OLEM can 
take the Split operation for case (b) edge. Hence it can create 
enough communities in the early stage and obtains an acceptable 
final partition in email-Enron and WikiTalk data sets. 

To compare OLTM and OLEM's operations, we plot the 
percentage of different operations otOLTM and OLEM over time 
in Fig. 5 and 6. We can see that OLTM generally only takes the 
Join and Dense operations until very later stage while OLEM takes 
many Split operations in the early stage in the email-Enron and 
WikiTalk data sets. Therefore, OLEM^s temporal modularity 
increases steadily over time while OLTM's temporal modularity 
remains zero until very later stages in emaU-Enron and WikiTalk 
data sets (See Fig. 4). In fact, OLEM can obtain an acceptable 
modularity even in early stage for the email-Enron and WikiTalk 
data sets. 
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